# White Wine Quality Exploration by Yilin Du

The data set includes 4898 white wine samples. For each record, the inputs include objective tests (e.g. PH values) and the output is the wine quality between 0 (very bad) and 10 (very excellent) graded by the wine experts.

Univariate Plots Section

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

There are 11 input variables associated with the white wine quality. And all of them are of type “number”.

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Most of the wines are rated between 5 - 7. Only 5 samples are rated at 9. None of the wine has a full score of 10.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

pH of the wines is in normal distribution.

The wines have quite narrow density range, most of them are between 0.99 - 1.00. When the binwidth is very small(0.00001), it is hard to discover the distribution pattern of the density.

With a larger binwidth, we find the density distribution is also normal distribution. Because the range of density is very narrow, we devided the density into different groups as follow.

rom the density group plot, we can easily see most of the wines have density between 0.9917-0.0067, only a very small amount of wines have a density larger than 1.002.

## 
## (0.9917,0.9967]  (0.9967,1.002]   (1.002,1.007]   (1.007,1.012] 
##            2675             995               6               2 
##   (1.012,1.017]   (1.017,1.022]   (1.022,1.027]   (1.027,1.032] 
##               0               0               0               0 
##   (1.032,1.037]            <NA> 
##               0            1220

We also find 1220 of 4898 wines are missing the density information in the database, so density may not be a variable we want to explore in the following analysis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

The distribution of residual.sugar seems different from other varialbles we explored before. It is not following normal distribution in the above plot.

To further explore the residual.sugar variable, we change the x scale to log10. Then we see it falls into two different groups(low and high). And for each group, it looks like normal distribution

From the above analysis, we see the outliers of the residual.sugar(few wine samples have vary large residual.sugar compared with the others). In this case, boxplot is a good way to depict the ourliers of the variable.

To draw a relation between free.sulfur.dioxide and total.sulfur.dioxide, we create another variable “ratio.sulfur.dioxide”, which is defined by the ratio of free and total sulfur.dioxide.

Alcohol is not like normal distribution, we devide it into different group as follows: unlike density, we have complete data of alcohol in the database, with only 2 NA record.

##  (8,10] (10,12] (12,14] (14,16]    NA's 
##    2083    2102     709       2       2

Univariate Analysis

What is the structure of your dataset?

The data set includes 4898 white wine samples. For each record, the inputs include 11 objective tests (e.g. PH values) and the output is the wine quality between 0 (very bad) and 10 (very excellent) graded by the wine experts.

What is/are the main feature(s) of interest in your dataset?

The wine quality is the main feature of interest. From the dataset, we want to explore the relation between different objective tests and the associated quality score.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

We explore the distributions of different variables, e.g. pH, density, alcohal, residential sugar, free sulfer dioxide and total sulfer dioxide. There are features of the wine and we are trying to determine which features contribute to higher quality score

Did you create any new variables from existing variables in the dataset?

I created a new variable: ratio_sulfer_dioxide, which is defined by the ratio of free_sulfer_dioxide and total_sulfer_dioxide. This dimensionless variable will help us to understand more of the relation between free_sulfer_dioxide and total_sulfer_dioxide. It also make it comparable between different wines.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

The distribution of residual.sugar seems different from other varialbles we explored before. It is not following normal distribution when I first plotted it. When I change the x-scale to log10, I find it has two different groups and each follows normal distribution.

Bivariate Plots Section

## 'data.frame':    4898 obs. of  17 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ density.group       : Factor w/ 9 levels "(0.9917,0.9967]",..: 2 1 1 1 1 1 1 2 1 1 ...
##  $ ratio.sulfur.dioxide: num  0.265 0.106 0.309 0.253 0.253 ...
##  $ alcohol.group       : Factor w/ 4 levels "(8,10]","(10,12]",..: 1 1 2 1 1 2 1 1 1 2 ...
##  $ quality_group       : Factor w/ 2 levels "(2,6]","(6,9]": 1 1 1 1 1 1 1 1 1 1 ...

Let’s first take a look at all the test varialbles, including the ratio.sulfur.dioxide and alcohol group we defined in previous section. The variables we want to further discover are: fixed.acidity, residual.sugar, pH, alcohol, quality, density.group, ratio.sulfur.dioxide and alcohol.group.

We’ll take a closer look at the plots above in the following sections.

From the fixed.acidity v.s. quality plot, we can tell the higher quality wines have a slightly lower fixed.acidity. But the variable v.s. quality plot not seems to be a propriate one in the bivariate plots. From the following plots, we cannot really summarize the relation between one variable alone with the quality score

Let’s now look at the relation between two variables. First plot is between residual.sugar and alcohol. Most of the points are in the left of the plot because few samples have large residual.sugar value. To better display the distribution, we take 99% quantile of residual.sugar data in the following plot.

We can see there is a relation between alcohol and residual.sugar when the residual.sugar is larger than 2. Larger residual.sugar value indicates relative smaller alcohal in the wines.

## 
##  Pearson's product-moment correlation
## 
## data:  density and alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376

There is an almost linear relation between density and alcohal, the sample estimates cor value is -0.78.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

from the fix.acidity and quality plot, we can tell the higher quality wines have a slightly lower fixed.acidity.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I observe a strong correlation between density and alcohol (almost linear). Higher density indicates lower alcohol.

What was the strongest relationship you found?

It’s between desity and alcohol as described above

Multivariate Plots Section

## 'data.frame':    4898 obs. of  17 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ density.group       : Factor w/ 9 levels "(0.9917,0.9967]",..: 2 1 1 1 1 1 1 2 1 1 ...
##  $ ratio.sulfur.dioxide: num  0.265 0.106 0.309 0.253 0.253 ...
##  $ alcohol.group       : Factor w/ 4 levels "(8,10]","(10,12]",..: 1 1 2 1 1 2 1 1 1 2 ...
##  $ quality_group       : Factor w/ 2 levels "(2,6]","(6,9]": 1 1 1 1 1 1 1 1 1 1 ...

From the previous section, we see density and alcohol are closely correlated. We add a second variable - quality to the plot. It’s not clear here with so many colors. But it seems that the higher quality(purple and pink) samples are in the upper-left corner, which indicates lower density and higher alcohol.

To reduce the colors in the plot, we create quality_group. We can see clearly on this plot that the higher quality score wines (green dots) are those with lower density and higher alcohol compared with low quality group

Similar quality group applies to alcohol-residual.sugar plot. No obvious findings here.

## # A tibble: 6 × 3
##   quality alcohol.group     n
##     <int>        <fctr> <int>
## 1       3        (8,10]     7
## 2       3       (10,12]    10
## 3       3       (12,14]     2
## 4       4        (8,10]    81
## 5       4       (10,12]    74
## 6       4       (12,14]     8

For different alcohol.group, we compare the distribution of the quality score. They all look like normal distribution. However, the mean position for the high alcohol group tends to shift right in the above plot.

## # A tibble: 6 × 4
##   quality alcohol.group     n       freq
##     <int>        <fctr> <int>      <dbl>
## 1       3        (8,10]     7 0.36842105
## 2       3       (10,12]    10 0.52631579
## 3       3       (12,14]     2 0.10526316
## 4       4        (8,10]    81 0.49693252
## 5       4       (10,12]    74 0.45398773
## 6       4       (12,14]     8 0.04907975

From the frequency bar plot, we can clearly see that higher alcohol group wines takes increasing proportion in higher quality score.

## # A tibble: 6 × 4
##   quality   density.group     n      freq
##     <int>          <fctr> <int>     <dbl>
## 1       3 (0.9917,0.9967]     9 0.5625000
## 2       3  (0.9967,1.002]     7 0.4375000
## 3       4 (0.9917,0.9967]   108 0.7883212
## 4       4  (0.9967,1.002]    29 0.2116788
## 5       5 (0.9917,0.9967]   920 0.6789668
## 6       5  (0.9967,1.002]   432 0.3188192

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Higher quality score wines in general are those with lower density and higher alcohol compared with low quality wines

Were there any interesting or surprising interactions between features?

Higher alcohol group wines takes increasing proportion in higher quality score.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

The distribution of residual sugar in the wines are bimodal on log scale. For the low residual sugar distribution, they appear like normal distribution.

Plot Two

Description Two

We plot the residual sugar and alcohol relation for all the wine samples. We can see these two variables are closely correlated to a almost linear relation. Then we assign different colors for different quality wines. The wines are devided into 2 quality group: with quality score between 3 - 6 (low quality) and score between 7 - 9 (high quality). An interesting finding is: high quality wines are more likely to be high in alcohol and low in residual sugar.

Plot Three

Description Three

In this plot, we are trying to explore the relation between alcohol and different wine quality scores. We devided the alcohol into 4 groups (8-10%, 10-12%, 12-14% and 14-16%). In this plot, we only see a very tiny portion for 14-16% at wine quality 7, so we just ignore this group and focus on the other 3 alcohol groups. For 10-12% alcohol group, the proportion remains similar for all wine qualities. However, the higher alcohol group(12-14%) takes an increasing portion of the wine samples with higher wine quality.


Reflection

The white wines dataset includes 4898 samples. For each sample, it includes 11 variables on the wines, like pH, density, alcohol, etc. And the quality score between 0 and 10 is also givin to each sample.

In the first section, I tried to plot the frequency plot for each variables to see if there is any abnormal distributions. Most of the variable distributions appeared more or less like normal distribution. However, the first plot for residual sugar was not clear, I changed the x scale to log10 and found the distribution of residual sugar in the wines are bimodal on log scale.

Then I explored the relation between two variables. Here was where I ran into difficulties. I couldn’t find any interesting relations when I plot each variable and the quality score in one figure, mainly because the quality score is descrete integer between 0 and 10. I decided to explore the variables and quality scores relation in multi-variable plot section instead of bi-variable plots. However, I did find an interesting almost linear relation between residual sugar and alcohol. As I added a third variable (quality group) to the plot, I found high quality wines are more likely to be high in alcohol and low in residual sugar. For the multi-variable plot section, I started with some interesting plots from the previous section and add a third variable to them.

There are definitely a lot more can be done on this dataset. For the first section, I found 1220 of 4898 wines are missing the density information in the database. If this information is given, we are able to explore more on the density influence to quality score. There are so many variables in the dataset that I didn’t come up with a correlation between the variable and quality scores. It’ll be great to work further to create a quality score prediction model based on the variables given.